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ABSTRACT 

Here, we present APPRIS (http://appris.bioinfo.cnio. 
es), a database that houses annotations of human 
splice isoforms. APPRIS has been designed to 
provide value to manual annotations of the human 
genome by adding reliable protein structural and 
functional data and information from cross-species 
conservation. The visual representation of the anno- 
tations provided by APPRIS for each gene allows 
annotators and researchers alike to easily identify 
functional changes brought about by splicing 
events. In addition to collecting, integrating and 
analyzing reliable predictions of the effect of 
splicing events, APPRIS also selects a single refer- 
ence sequence for each gene, here termed the prin- 
cipal isoform, based on the annotations of structure, 
function and conservation for each transcript. 
APPRIS identifies a principal isoform for 85% of 
the protein-coding genes in the GENCODE 7 
release for ENSEMBL. Analysis of the APPRIS data 
shows that at least 70% of the alternative (non- 
principal) variants would lose important functional 
or structural information relative to the principal 
isoform. 



INTRODUCTION 

Genes can generate a wide range of mature RNA variants 
through the alternative splicing process (1,2). In fact, 
studies have revealed that virtually all multi-exon human 
genes (3,4) are capable of producing at least two RNA 
transcripts by alternative splicing. Alternative splicing 
events that occur within coding regions will produce alter- 
native transcripts that have the potential to be translated 



into distinct gene products. Those alternative transcripts 
that are not picked up by the cellular surveillance machin- 
ery, such as nonsense-mediated decay (NMD; 5,6), 
non-stop decay (7) and no-go decay (8), may contribute 
to an increase in the complexity of the ceU. Alternative 
gene products often have a surprising level of diversity 
(9) and can therefore have very different biological and 
cellular properties. Thus, the suggestion that the re- 
arrangement of exons conducted by alternative splicing 
may enrich the repertoire of cellular functions (10). 

Genome annotation projects are producing gene sets 
saturated with alternative splicing variants (11). If alter- 
native sphcing does have the potential to expand the 
cellular functional repertoire in eukaryotic species, it 
would seem to be important to assign roles to these 
splicing variants. However, the sheer quantity of 
genomic data generated by these projects (12,13) 
presents serious challenges for functional annotation. At 
present, almost all the experimental information related to 
alternative coding variants has been generated for RNA 
transcripts rather than protein isoforms. Despite the fact 
that there is only piecemeal experimental data for the 
cellular role of alternative isoforms, it is possible to 
predict the hkely biological effects of alternative splicing. 

APPRIS has been developed within the GENCODE 
consortium (14) to annotate alternative gene products 
with rehable, biologically relevant data. GENCODE 
provides high-accuracy manual annotations of protein- 
coding loci and alternative variants as part of the 
ENCODE project (12,15). The GENCODE annotations 
are gradually replacing and augmenting the Ensembl (16) 
automatic annotations. As part of the GENCODE 
annotation process, APPRIS flags isoforms with hkely 
altered structure, function or localization, and exons 
that are evolving unusually. The information from 
APPRIS is fed back to the manual annotators and has 
lead to the annotation of new isoforms. 
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APPRIS annotates variants with protein structural and 
functional information, signal peptides and trans- 
membrane helices, conservation of related species, the 
conservation of exonic structure and exon evolutionary 
rates. APPRIS currently annotates the GENCODE/ 
Ensembl merge of the human genome. 

The novel feature of APPRIS is that it selects a princi- 
pal isoform for each gene based on the rehable annota- 
tions for protein structure, function and cross-species 
conservation. The principal isoform is the representative 
isoform of the gene, the isoform against which all other 
(alternative) isoforms should be compared. In APPRIS, 
the principal isoform is the isoform with the main 
cellular function, the isoform that is expressed in the 
majority tissues or in most stages of development or 
the isoform that is the most evolutionary conserved. The 
process of selecting principal isoforms is illustrated with 
biologically relevant examples in the 'APPRIS annotation' 
section. 

It is particularly difficult to automate the selection of a 
single representative for a gene, and ah large-scale 
genomic analyses and databases such as Ensembl (16) 
and SwissProt (17) get round this problem by simply 
selecting the longest isoform as the main variant. 
Although this is a safe choice, and is often correct, we 
have shown that it is not always the best strategy — only 
~75% of the isoforms selected by this strategy are hkely to 
be principal (18). 

We performed an initial study on the feasibihty of 
selecting principal isoforms using a number of prediction 
methods (18). The methods used to pinpoint principal 
functional isoforms were based on conservation and the 
characteristics of known proteins, principally structural 
and functional features. We determined a principal 
variant for 179 of the 215 human genes in the study, 
83% of genes with multiple alternative variants. Where 
the principal variants selected in the study differed from 
the SwissProt display sequences, we found annotation 
evidence from cross-species ahgnments that supported 
our selection over the SwissProt display sequence. 

Based on this initial study, we developed APPRIS. 
APPRIS is made up of eight separate annotation 
modules, each with a specific role. For example, firestar 
(19) predicts the presence of individual functionally im- 
portant residues in splice variants, Matador3D predicts 
the effect of splicing events on 3D structure and 
INERTIA detects exons that are undergoing unusual 
evolution. 



THE DATABASE 

APPRIS was developed using version 3c of the 
GENCODE annotation (Ensembl 56), which was the 
initial Ensembl/GENCODE merge, and currently runs 
GENCODE version 7 (Ensembl 62). Between 
GENCODE 3c and 7, the annotation was cleaned of 
~2000 genes (mostly automatic annotations removed by 
Ensembl), while more 10000 new annotated coding 
variants were added. GENCODE release 7 recognizes 
20 687 protein-coding genes and 84408 distinct-coding 



transcripts. APPRIS is updated with each new stable 
GENCODE release and is currently being updated to 
GENCODE version 12. 

The APPRIS system (Figure 1) is composed of eight 
separate annotation modules. These eight modules do 
not comprise an exhaustive hst of ah possible protein 
features. Instead, the methods used in APPRIS were 
chosen for their abihty to select principal isoforms. Each 
method either detects the absence of highly conserved 
protein features (as highly conserved protein features are 
extremely unlikely to have arisen by chance, we can 
discard variants that lack these features) or calculates 
cross-species conservation (the more conserved an exon/ 
transcript, the more Hkely that it represents the principal 
variant). None of the computational methods behind each 
module is previously unpublished. Instead, the methods 
have either been combined in novel ways or have been 
adjusted especially for APPRIS, and the output of aU 
the methods has been tuned to keep false-positive predic- 
tions to a minimum, albeit at the expense of coverage. 

The eight modules are as follows (see Supplementary 
Data for further details). Matador3D checks for the 
presence of structural homologs in the PDB (20) and 
tests the integrity of the 3D structure; firestar (19) makes 
highly rehable predictions of conserved functionally im- 
portant amino acid residues; SPADE uses the program 
Pfamscan (21) to count conserved and compromised 
Pfam functional domains; INERTIA uses three ahgnment 
methods (22-24) to generate cross-species alignments, 
from which SLR (25) identifies exons with unusual evolu- 
tionary rates; CRASH makes conservative predictions of 
signal peptides using the SignalP and TargetP programs 
(26); THUMP generates conservative predictions of trans- 
membrane helices from three separate /;'a«^-membrane 
predictors (27-29); CExonic uses exonerate (30) to align 
mouse and human transcripts and looks for patterns of 
conservation in exonic structure and CORSAIR uses 
BLAST (31) to map vertebrate orthologs to each variant 
and counts the numbers of orthologs that ahgn correctly 
and without gaps. AU of these methods are available as 
web services. 

Annotation and selection of principal isoforms 

In addition to annotating alternative isoforms with biolo- 
gical data, APPRIS selects a principal isoform from 
among the isoforms annotated for each gene. The selec- 
tion of a principal isoform by APPRIS is based on two 
principles. The first principle is that there is often one 
isoform that performs the main cellular function or that 
is expressed in the majority tissues or in most stages of 
development, and that the rest of the annotated isoforms 
are alternatively spliced isoforms that may perform 
distinct roles. Proteomics evidence suggests that this 
seems to hold true for many genes (32-37) although 
there are genes for which it is more difficult to define a 
principal isoform precisely because there are a number of 
isoforms that might be regarded as equally important for 
cellular function. 

The second principle is that the principal isoform 
should have more evolutionary history. The principal 
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Figure 1. The APPRIS system. The inputs to the web services are the peptide sequences of the isoforms, the sequences of the transcripts and 
cross-species alignments of either transcripts or peptides. The output of all eight individual web services is used by APPRIS to annotate alternative 
variants with structural, functional and conservation information and to select a principal isoform for each gene. 



isoform ought to be the variant that is most conserved 
across related species. It has been shown that ahernative 
exons tend to have evolved more recently (38), so this is a 
reasonable assumption. 

The methods that make up APPRIS detect unusual, 
missing or non-conserved features and will flag these tran- 
scripts as alternative. The selection of a principal isoform 
is based on a jury of the eight methods that make up the 
pipehne. The isoform selected as principal will be either 
the variant that has the most conserved protein features 
(since it is much more likely that alternative isoforms have 
lost rather than gained protein features such as 3D struc- 
ture and function) or that has more evidence of cross- 
species conservation, or, most frequently, both. Four 
methods (SPADE, CORSAIR, Matador3D and firestar) 
make up the core of the jury system, with the other 
methods becoming more important in cases where these 
four methods are not able to make a decision. 

It should be noted that GENCODE-coding transcripts 
are not all considered equally. First, transcripts with iden- 
tical CDS (in other words those that undergo alternative 
splicing only in 3'- or 5'-UTR) are regarded as identical for 
the purposes of selecting a principal protein isoform in 
APPRIS. Second, transcripts that are annotated as NMD 
targets are annotated with protein features, but cannot be 
selected as the principal variant by APPRIS. The same is 
also true of all transcripts annotated as fragments. 

For the few cases where the methods in APPRIS tag all 
the variants as alternative (in most cases these are genes 
with 'read-through' transcripts), the gene is brought to the 
attention of the GENCODE manual annotators. 



A hst of principal isoforms selected by APPRIS for each 
version of GENCODE/Ensembl is available. In the few 
cases where APPRIS is not able to determine the main 
isoform, the variant with the longest protein sequence is 
selected from among those isoforms not rejected 
by APPRIS. 

APPRIS will be updated with the Ensembl/GENCODE 
database and will be extended to cover mouse gene models 
in the near future as the GENCODE annotators focus on 
mouse models, although in theory APPRIS could be 
extended to incorporate data from any well-annotated 
eukaryotic genome. 

System architecture and user notes 

The APPRIS web site allows the user to search genes and 
transcripts and displays six panels of annotations. The 
first panel shows all the GENCODE/Ensembl-coding 
variants and highhghts the main functional variant. The 
second panel shows the APPRIS annotations in detail and 
includes information such as the number of functional 
residues detected by firestar, the Matador3D homologous 
structure score or the number of vertebrate species that 
ahgn in CORSAIR. The next two panels map the APPRIS 
annotations onto the amino acid sequences of all coding 
variants and make the annotations visible in the genome 
regions provided by the UCSC Genome Browser (39). 
Finally, there are panels that allow the user to compare 
and contrast proteomic (37,40) and RNAseq (4) evidence 
tracks against the APPRIS annotation tracks in the UCSC 
Browser (see Supplementary Data for more details). 
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APPRIS has been designed to be portable, modular and 
flexible and it can be accessed as web services. These services 
can retrieve the results of the execution of APPRIS methods 
and other useful information for genes/transcripts. Plain 
text, JSON/GTF format or BED format (which facilitates 
the visualization of annotation tracks across genomic 
regions) outputs are available. In addition, APPRIS 
supports the downloading of data through the highly cus- 
tomizable BioMart (41) data mining tool. 

APPRIS uses a MySQL relational database to store the 
information that can be downloaded from APPRIS 
web site. A comprehensive set of application programming 
interfaces (APIs) serve as a middle layer between underlying 
database schemes. The APIs encapsulate the database 
layout by providing efficient high-level access to data 
tables and isolate applications from data layout changes. 



APPRIS ANNOTATIONS 

APPRIS identifies a principal isoform for the majority of 
human genes. APPRIS determines a principal isoform for 
17 731 (85.7%) of the 20 687 protein-coding genes in the 
GENCODE 7 (Ensembl 62) release (Supplementary 
Figure S4). A total of 53 307 variants were tagged as 
alternative by the methods in APPRIS and 22 799 tran- 
scripts were identified as principal isoforms (the discrep- 
ancy between genes and transcripts is because many 
transcripts are only alternatively spliced in the 3'- or 
5'-regions and are regarded as identical by APPRIS 
because they have identical coding sequences). 

Many of the isoforms recognized by APPRIS as alter- 
native are likely to have substantial changes to their struc- 
ture and function. A total of 37 550 alternative splice 
variants (70.4% of the variants tagged as alternative) 
would lose important functional or structural information 
relative to the principal isoform. The conservative esti- 
mates from the APPRIS methods show that 15 087 
variants (28.3%) would lose important functional 
residues, 31 169 alternative gene products (58.5%) would 
have damaged or lost Pfani functional domains and 26 955 
alternative isoforms (50.6%) would lose a substantial part 
of their 3D structure. 

More than 8175 of the annotated transcripts would lose 
at least one ^/-a«.v-membrane helix and 543 would have lost 
a signal sequence. Almost 50 000 alternative splice variants 
(49 899, 94.5% of variants tagged as alternative) were less 
conserved across related species than the principal variant 
(from the resuhs of CORSAIR, CExonic or INERTIA). 

The CCDS project (42) is identifying consistently 
annotated, high-quality protein-coding variants for the 
human genome. CCDS variants are annotated only 
when there is agreement between the three main public 
annotation resources (GENCODE/Ensembl, NCBI and 
UCSC). Although the CCDS project can annotate any 
number of variants for a gene, many genes have a single 
CCDS variant, a variant agreed upon by all annotation 
resources. A single CCDS variant is the closest thing to an 
APPRIS principal variant, therefore, we should expect to 
see high agreement between the APPRIS constitutional 
isoforms and the CCDS variants. 



For those genes that have multiple isoforms and a single 
CCDS variant, APPRIS is in agreement with the CCDS 
variant 93.5% of the time. What is more, this rises to over 
96% for the core modules. This compares to an agreement 
of just 79.2% for the strategy of selecting the longest 
isoform (see Supplementary Table SI). 

Two examples (of many) serve to illustrate the utihty of 
APPRIS in the selection of principal isoforms. In the first 
example (gene DNAJC5G), APPRIS disagrees with the 
CCDS annotation by selecting an isoform that is 16 
residues shorter than the pair of protein sequence identical 
isoforms chosen as the single CCDS variant, as the 
Ensembl reference sequence and as the SwissProt display 
sequence (Figure 2A). The variant selected by APPRIS 
(DNAJC5G-004) has a better score in Matador3D 
(it maps better to the known 3D structures in the PDB) 
and has a conserved Pfam domain. In contrast, the longer 
sequences would have broken Pfam domains and 3D 
structure (Figure 2B). The extra exon in the CCDS 
variant generates a 16-residue insertion that would be 
hkely to disrupt a 3D structure (Figure 2C) and a 
conserved Pfam domain (Figure 2D). 

The second example concerns the TP63 gene. There are 
two well-studied isoforms for this gene, TA-alpha (43) and 
deltaN (44). They are generated from different translation 
start sites and generate different N-terminals (Figure 3A). 
However, rather than elect one of these two, APPRIS 
gives the best score to variant TP63-013, a 582-amino 
acid protein. Although this result might be surprising at 
first glance, it is perfectly logical. 

TP63 is annotated with 15 coding variants in 
GENCODE 7, and all but 4 (TA-alpha, deltaN, 
P63delta and TP63-013) are rejected as potential principal 
isoforms by the SPADE (Pfam domains) and Matador3D 
(3D structure) modules in APPRIS. P63delta and 
TP63-013 are generated from TA-alpha and deltaN, 
respectively, by a known GYNGYN splicing event (45) 
that results in a swap of five amino acids 'GTKRP' for 
a single alanine in a non-critical region of the protein 
(Figure 3A). CORSAIR (vertebrate sequence database in- 
formation) separates these four variants based on align- 
ments with isoforms from other species (Figure 3B). 
It turns out that there is more ortholog evidence for 
deltaN and its GYNGYN variant TP63-013 than there 
is for TA-alpha and P63delta (the deltaN splice event is 
conserved as far back as chicken and Danio). There is also 
more CAGE data support for the deltaN/TP63-013 trans- 
lation start site (A. Frankish, personal communication). 

In addition, CORSAIR selects TP63-013 ahead of 
deltaN because Danio isoforms in the sequence databases 
have the single alanine instead of the GTKRP motif. In 
fact, the 3D structure of this region of the protein has also 
been solved for the isoform with the single alanine, adding 
weight to the APPRIS selection. 

These examples neatly demonstrates the process behind 
APPRIS and reinforce the idea that it is possible to des- 
ignate a principal isoform based on protein features and 
evolutionary antiquity, even where two isoforms (or more 
as in the case of TP63) have clearly defined functional 
roles. 
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Figure 2. APPRIS annotations for gene DNAJC5G. (A) Snapshot of the APPRIS page for gene DNAJC5G, showing the five protein-coding 
transcripts annotated by Ensembl and the selection of the principal isoform by APPRIS (shown by the green tick). (B) A selection of the annotation 
results from the individual modules in APPRIS (Matador3D maps 3D structure to the isoforms, SPADE maps Pfam functional domains and 
INERTIA detects unusually evolving exons). The principal isoform is highlighted. APPRIS chooses isoform DNAJC5G-004 as the main variant 
based on the output of SPADE and Matador3D and designates the two 'KNOWN' isoforms (which are also CCDS variants) as alternative variants. 
(C) The 3D structure of mouse DNAJ subfamily C2 member 5 (PDB: 2CTW), to which DNAJC5G-004 has 56% identity with no gaps. The coloring 
on the 3D structure comes from the predominant coloring in the Pfam multiple alignment in (D). The large red arrow shows that the 16 extra 
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Figure 3. APPRIS result for gene TP63. (A) The four variants of gene TF63 that score highest in APPRIS, highlighting the 5'-splice junction 
differences between TA and P63delta, deltaN and TP63-013, and the GYNGYN splice event that differentiates TA and deltaN from P63delta and 
TP63-013. (B) Annotation results from some of the individual modules in APPRIS (Matador3D maps 3D structure to the isoforms, SPADE inaps 
Pfam functional domains and CORSAIR detects orthologous isoforms in related species). The isoforms are color-coded as in (A) and we have added 
two other well-known TP63 variants to the table. APPRIS chooses isoform TP63-013 as the main variant based on the output of the three modules. 



DISCUSSION 

APPRIS deploys a range of computational methods to 
annotate alternative isoforms with protein structural and 
functional information and to evaluate cross-species con- 
servation. The database provides rcHable functional anno- 
tations for the most recent version of the manual 
annotation of the human genome. The APPRIS annota- 
tions will allow genome annotation groups and individual 
researchers to track the effect of alternative spHcing events 
on individual splice isoforms. 

There are already a number of databases that can 
annotate alternative transcripts with some of these 
features (46-48). What sets APPRIS apart from all these 
databases is that APPRIS provides high-quality annota- 
tions that are being used in the annotation of the human 
genome and that these annotations are used to select a 
principal isoform for each gene. Principal isoforms are 
selected based on evolutionary evidence in the form of 
conserved functional and structural motifs and cross- 
species conservation. The success of APPRIS is due to 
the observation that most alternative isoforms lack 
regions of conserved structure or function, or have 
exons that are evolving at measurably different rates 
compared with their principal counterparts. The 
APPRIS database has been able to identify a principal 
isoform for the majority of human genes (85%). 

APPRIS is the first database to include principal 
isoforms on a genome-wide scale. Previously all 
database and large-scale studies have had to resort to se- 
lecting the largest annotated isoform as the reference 
variant. We have shown that this conservative solution 
is not ideal (18). The lack of reliably identified principal 



isoforms in annotation databases is an omission that is 
only going to become more glaring with time as the 
numbers of annotated variants in the sequence databases 
grow. 

At present, most computational methods (49) and data- 
bases (21) are based on the assumption that a single 
isoform represents each gene. The SwissProt database, 
for example, combines all variants of the same gene in a 
single entry. These entries include experimental data and 
predictions, which are widely referenced from a number of 
external sources. One of the sequences in each entry, 
almost always the longest, is designated as the display 
sequence and the remaining sequences are included as 
splice variants. External databases and methods that use 
SwissProt as their standard often ignore these alternative 
sequences. If databases are going to condense gene 
products from the same gene into a single entry for tech- 
nical reasons, it is better that the sequence that 'represents' 
the gene is the APPRIS principal isoform. 

APPRIS principal isoforms have a wide range of uses 
and are applicable in all fields of research. Determining a 
principal isoform is important for research groups 
studying individual genes, since it is vitally important for 
designing experimental work. Researchers need to be able 
to work with the isoform that is most hkely to have major 
functional activity, and this is not always clear for aU 
genes. The designation of a single variant as the principal 
isoform is a critical first step for any genome analysis, for 
example, studies of cancer mutations (50) would be able to 
use APPRIS data to determine whether the mutations are 
principal or alternative exons, and proteomics studies 
could use APPRIS data to decide whether a peptide 
would be generated from on alternative or principal 
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exon. Since automatic prediction methods rely on the 
quality of input data, starting from the principal isoform 
should allow groups to perform more reliable studies. 
Finally, the selection of a principal isoform also serves 
as a starting point for investigations into the functional 
potential of alternative isoforms. These are just a few 
examples; the potential for the use of the APPRIS data 
in research is huge. 

APPRIS is currently being used to annotate protein- 
coding genes by annotators in the GENCODE consortium 
(14) and the CCDS project (42). Annotations based on 
APPRIS data are already percolating to many users 
through these databases. We hope that the APPRIS prin- 
cipal isoforms will become accepted as the standard refer- 
ence sequence for each gene. We believe that the principal 
isoforms identified by APPRIS are a significant advance 
on the current practice of selecting the longest variant 
as the reference isoform and that they should be used in 
all automatic genome-wide protocols and large-scale 
analyses. 

The APPRIS annotations and the list of principal 
isoforms are accessible to all and are available for 
download in a range of formats. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Table 1, Supplementary Figures 1-4, 
Supplementary Methods, Supplementary Results and 
Supplementary References [51-57]. 
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